Construction of Mizo – English Parallel Corpus for Machine Translation

نویسندگان

چکیده

Parallel corpus is a key component of statistical and Neural Machine Translation (NMT). While most research focuses on machine translation, creation studies are limited for many languages no paper Mizo–English exists yet. A high-quality parallel required Natural Language Processing (NLP) activities including Chatbots, Transliteration, Cross-Language Information Retrieval. This work aims to investigate techniques apply them the Mizo-English language pair. Another goal test translation newly constructed corpus. We contributed LF Aligner tool support Mizo sentence alignment in development. Our effort created first large-scale with over 529K sentences. The pre-processed was used Mizo-to-English NMT. It evaluated using BLEU, ChrF, TER scores. system achieved BLEU 45.08, ChrF 65.36, 41.16, setting new benchmark translation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Catalan-English statistical machine translation without a parallel corpus

This paper presents a full experiment on large-vocabulary Catalan-English statistical machine translation without an English-Catalan parallel corpus, in the context of the debates of the European Parliament. For this, we make use of an English-Spanish European Parliament Proceedings parallel corpus and a Spanish-Catalan general newspaper parallel corpus, both of which of more than 30 M words. G...

متن کامل

UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation

Parallel corpus is a valuable resource for cross-language information retrieval and data-driven natural language processing systems, especially for Statistical Machine Translation (SMT). However, most existing parallel corpora to Chinese are subject to in-house use, while others are domain specific and limited in size. To a certain degree, this limits the SMT research. This paper describes the ...

متن کامل

Automatic Construction of Translation Knowledge for Corpus-based Machine Translation

Many machine translation (MT) systems that utilize the knowledge automatically acquired from bilingual corpora have been proposed in conjunction with efforts to accumulate corpora. We call this approach corpus-based machine translation in this thesis. This thesis focuses on automatic construction of the translation knowledge needed for corpus-based MT and discusses the following three tasks. 1....

متن کامل

a corpus-hased study of units of translation in english-persian literary translation

چکیده ندارد.

15 صفحه اول

HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation

∗Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics [email protected] †Charles University in Prague, Faculty of Arts, Department of Linguistics [email protected] ‡Natural Language Processing Centre, Faculty of Informatics, Masaryk University [email protected], [email protected] Abstract We present HindEnCorp, a parallel corp...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: ACM Transactions on Asian and Low-Resource Language Information Processing

سال: 2023

ISSN: ['2375-4699', '2375-4702']

DOI: https://doi.org/10.1145/3610404